Prompt

Describe an R package that you use regularly. What are the most useful class/methods/functions? What are the limitations, gotchas, bugs in the package? Can you white-board a strategy that might work to improve the package? We are interested in how well you know your tools and how interested you are in improving stuff you use.

Submit your notes, code blocks, explanation/demo as a markdown file or RStudio/RMarkdown to a public github repo and submit the url for review.

Description

Overview

The R package I use most frequently is Seurat. Seurat is a toolkit for single-cell genomics analysis developed by Rahul Satija’s lab at the New York Genome Center and one of the most popular analysis packages due its excellent documentation and thorough vignettes. It also has the added benefit of being based in R, which many biologists and statisticians are familiar with. There are of course, other powerful and popular analysis packages/workflows such as:

I believe Seurat is a popular option for single-cell analysis, not only because of the informative vignettes, but because the development team are considerate of the way that most people, not only experts, will interact with and use their package. In this way, I see Seurat as having the lowest barrier of entry, and one of the best places for beginners to start their journey with single-cell analysis. Their github issues page is actually a welcoming place, where many questions are answered not only about the inter workings of the library, but also about experimental design and best practices, e.g. #2461, currently they have over 5500 closed issues.

Seurat also offers a number of built-in features such as, multi-modal data analysis, and the ability to analyze spatial data sets from 10x Visium and Slide-seq formats. They also have excellent options for integration, including the ability to integrate data from different modalities, and to annotate query data with pre-annotated references, aka “reference mapping”.

Loading Seurat and example data

library(Seurat)
library(SeuratData)
library(SeuratDisk)
library(tidyverse)
library(knitr)
library(scater)
library(reticulate)
library(ggrepel)
InstallData("pbmc3k")
data("pbmc3k")
pbmc3k
## An object of class Seurat 
## 13714 features across 2700 samples within 1 assay 
## Active assay: RNA (13714 features, 0 variable features)

Seurat Class

Object structure

The Seurat object is at the heart of the package, it contains all the necessary representations of the expression data as well as cell-level and feature-level metadata.

Metadata

The meta.data slot contains cell-level data, such as the total counts (nCounts) and unique features for each cell (nFeatures). It is also an convenient place to store new information, such as cluster annotations or gene signature scores.

head(ifnb@meta.data)
##                   orig.ident nCount_RNA nFeature_RNA stim seurat_annotations
## AAACATACATTTCC.1 IMMUNE_CTRL       3017          877 CTRL          CD14 Mono
## AAACATACCAGAAA.1 IMMUNE_CTRL       2481          713 CTRL          CD14 Mono
## AAACATACCTCGCT.1 IMMUNE_CTRL       3420          850 CTRL          CD14 Mono
## AAACATACCTGGTA.1 IMMUNE_CTRL       3156         1109 CTRL                pDC
## AAACATACGATGAA.1 IMMUNE_CTRL       1868          634 CTRL       CD4 Memory T
## AAACATACGGCATT.1 IMMUNE_CTRL       1581          557 CTRL          CD14 Mono

Assays

Assays are individual representations of the expression data and are organized as (feature x observation) matrices (e.g. genes x cells). For example, the “RNA assay” is the default assay after object creation and will contain the raw count matrix. If sctransform normalization is performed on a data set, a new assay is created, the “SCT assay”. Similarly if integration is performed, an “Integrated assay” is created.

Assay slots

Within each assay, transformations of the data exist as “slots”. It’s important to note, that based on your workflow (traditional normalization v. sctransform normalization) these slots will be different types transformations (i.e. uncorrected v. corrected).

  • scale.data slot is a scaled matrix used for dimensional reduction and heatmaps.

  • data slot is normalized data used for differential gene expression, marker ID, etc.

  • counts slot is unnormalized raw count data.

Basic workflow

The basic Seurat workflow based on their popular vignette.

# Calculate the % of mitrochondrial reads/cell, store this in the metadata
pbmc3k[["percent.mt"]] <- PercentageFeatureSet(pbmc3k, pattern = "^MT-")

# Filter out low-quality cells
pbmc3k <- subset(pbmc3k, subset = nFeature_RNA > 200 &
                 nFeature_RNA < 2500 & percent.mt < 5)

# Log-normalize, identify highly variable genes, scale and run PCA
pbmc3k <- pbmc3k %>%  NormalizeData() %>% FindVariableFeatures() %>%
  ScaleData() %>% RunPCA()

# Run clustering (graph-based with louvain algorithm)
pbmc3k <- FindNeighbors(pbmc3k, dims = 1:10)
pbmc3k <- FindClusters(pbmc3k, resolution = 0.5)

# Run Non-linear dimensionality reduction
pbmc3k <- RunUMAP(pbmc3k, dims = 1:10)

# Plot
DimPlot(pbmc3k, label = T)

Useful features

Options for integration

In our research, we use scRNA-seq as a comparative method to explore differences in cell populations between wild-type and genetically modified mice. Therefore data integration is essential to ensure we are comparing shared cell populations across conditions, as well as removing unwanted batch effects.

Often, in our experimental framework, the difference between conditions (e.g. WT v. KO) isn’t very dramatic. So, we need workflows that allow us to retain biological heterogeneity across conditions, without overly-integrating the data and potentially hiding important differences. This of course puts a more weight on the importance of validating our findings with other methods, but we are happy to do this. Luckily, Seurat provides multiple methods for data integration, that can be applied based on the experimental design or magnitude of response between control and experimental groups.

The Seurat developers make an important note that their default integration workflow based on canonical correlation analysis (CCA), may in some cases lead to over-correction, pulling together cell states across conditions that shouldn’t be aligned. Their alternative, reciprocal PCA (RPCA) based integration, is a more conservative method and is less likely to align cells in different biological states. In this method, each data set is projected into each others PCA space, and the anchors are then constrained by the same mutual neighborhood requirement. When designating a reference dataset, the comparisons are then limited to just the reference vs query and anchors are not identified between pairs of query datasets, yielding similar results with less compute time. The reference-based RPCA approach is well suited to our work.

Using the built in ifnb data set, which consists of control and interferon-stimulated PBMC, the following is a rudimentary comparison of workflows to highlight the difference in computational time and integration “strength” between simple merging, CCA integration and reference-based RPCA integration. Here we only have one control and one query dataset for integration, but if we had replicates of each treatment, reference-based RPCA would save even more time.

# Install ifnb data with the SeuratData package
InstallData("ifnb")

# Merge based workflow
data("ifnb")
merge_start <- Sys.time()
ifnb.list <- SplitObject(ifnb, split.by = "stim")
ifnb.list <- lapply(X = ifnb.list, FUN = SCTransform)
variable_genes <- SelectIntegrationFeatures(object.list = ifnb.list)
ifnb <- merge(x = ifnb.list[[1]], y = ifnb.list[[2]])
VariableFeatures(ifnb) <- variable_genes
ifnb <- RunPCA(ifnb, npcs = 50)
ifnb <- RunUMAP(ifnb, reduction = "pca", dims = 1:25)
ifnb <- FindNeighbors(ifnb, dims = 1:25)
ifnb <- FindClusters(ifnb, resolution = 0.4)
merge_end <- Sys.time()
p1 <- DimPlot(ifnb, group.by = "stim") + labs(subtitle = "Merged") +
  theme(plot.title = element_blank(), legend.position = "none")

# CCA integration workflow
data("ifnb")
cca_start <- Sys.time()
ifnb.list <- SplitObject(ifnb, split.by = "stim")
ifnb.list <- lapply(X = ifnb.list, FUN = SCTransform)
features <- SelectIntegrationFeatures(object.list = ifnb.list)
ifnb.list <- PrepSCTIntegration(object.list = ifnb.list,
                                anchor.features = features)
anchors <- FindIntegrationAnchors(object.list = ifnb.list,
                                  normalization.method = "SCT",
                                  anchor.features = features,
                                  dims = 1:25)
ifnb <- IntegrateData(anchorset = anchors, normalization.method = "SCT")
ifnb <- RunPCA(ifnb, npcs = 50)
ifnb <- RunUMAP(ifnb, reduction = "pca", dims = 1:25)
ifnb <- FindNeighbors(ifnb, dims = 1:25)
ifnb <- FindClusters(ifnb, resolution = 0.4)
cca_end <- Sys.time()
p2 <- DimPlot(ifnb, group.by = "stim") + labs(subtitle = "CCA") +
  theme(plot.title = element_blank(), legend.position = "none")

# Reference-based RPCA integration workflow
data("ifnb")
rpca_start <- Sys.time()
ifnb.list <- SplitObject(ifnb, split.by = "stim")
ifnb.list <- lapply(X = ifnb.list, FUN = SCTransform)
features <- SelectIntegrationFeatures(object.list = ifnb.list)
ifnb.list <- PrepSCTIntegration(object.list = ifnb.list, anchor.features = features)
ifnb.list <- lapply(X = ifnb.list, FUN = RunPCA, features = features)
anchors <- FindIntegrationAnchors(object.list = ifnb.list,
                                  normalization.method = "SCT",
                                  reference = 1, # index of control obj in ifnb.list
                                  anchor.features = features,
                                  dims = 1:25,
                                  reduction = "rpca",
                                  k.anchor = 5)
ifnb <- IntegrateData(anchorset = anchors,
                      normalization.method = "SCT",
                      dims = 1:25)
ifnb <- RunPCA(ifnb, npcs = 50)
ifnb <- RunUMAP(ifnb, reduction = "pca", dims = 1:25)
ifnb <- FindNeighbors(ifnb, dims = 1:25)
ifnb <- FindClusters(ifnb, resolution = 0.4)
rpca_end <- Sys.time()
p3 <- DimPlot(ifnb, group.by = "stim") + labs(subtitle = "Reference-based RPCA") +
  theme(plot.title = element_blank())
# Compute time comparison
df <- data.frame(method=c("Merged", "CCA", "RPCA"),
           compute_time= c(merge_end-merge_start,
                           cca_end-cca_start,
                           rpca_end-rpca_start))
print(df)
##   method  compute_time
## 1 Merged 1.408759 mins
## 2    CCA 4.330760 mins
## 3   RPCA 1.999046 mins
# Integration comparison
p1 + p2 + p3 

Scoring gene programs

AddModuleScore() is a very useful function to calculate the average expression for a list of genes, this is great for showing gene programs across clusters.

# Log normalize and scale the RNA assay
DefaultAssay(ifnb) <- "RNA"
ifnb <- ifnb %>% NormalizeData() %>% ScaleData()

# Read in "response to interferon" gene ontology term; biological processes
gene_list <- read_csv(file = "data/gene_signatures.csv")
interferon <- gene_list$GO.0034340_response_to_type_I_interferon

# Genes have to be in list format
interferon <- interferon %>% na.omit() %>% unique() %>% list()

# Calculate the gene module score for each cell, this is stored in the metadata
ifnb <- AddModuleScore(ifnb, features = interferon, ctrl = 20, name = "interferon")
head(ifnb@meta.data)
##                   orig.ident nCount_RNA nFeature_RNA stim seurat_annotations
## AAACATACATTTCC.1 IMMUNE_CTRL       3017          877 CTRL          CD14 Mono
## AAACATACCAGAAA.1 IMMUNE_CTRL       2481          713 CTRL          CD14 Mono
## AAACATACCTCGCT.1 IMMUNE_CTRL       3420          850 CTRL          CD14 Mono
## AAACATACCTGGTA.1 IMMUNE_CTRL       3156         1109 CTRL                pDC
## AAACATACGATGAA.1 IMMUNE_CTRL       1868          634 CTRL       CD4 Memory T
## AAACATACGGCATT.1 IMMUNE_CTRL       1581          557 CTRL          CD14 Mono
##                  nCount_SCT nFeature_SCT integrated_snn_res.0.4 seurat_clusters
## AAACATACATTTCC.1       2066          847                      3               3
## AAACATACCAGAAA.1       1923          713                      0               0
## AAACATACCTCGCT.1       2074          817                      3               3
## AAACATACCTGGTA.1       2230         1076                     11              11
## AAACATACGATGAA.1       1778          634                      1               1
## AAACATACGGCATT.1       1624          557                      3               3
##                   interferon1
## AAACATACATTTCC.1 -0.330894340
## AAACATACCAGAAA.1 -0.091329258
## AAACATACCTCGCT.1 -0.091645183
## AAACATACCTGGTA.1 -0.246427288
## AAACATACGATGAA.1 -0.188778103
## AAACATACGGCATT.1 -0.003577001
# Visualize the module score expression 
FeaturePlot(ifnb, features = "interferon1", split.by = "stim")

Sub-setting

subset() makes it easy to do sub analysis of certain clusters, or cells based on expression levels. You can subset on any feature in the dataset. Although this isn’t a Seurat function, it works seamlessly with the Seurat objects, which saves a lot of time.

# Subset cluster 0
ifnb_sub <- subset(ifnb, idents = "0")

# Plot the full object, and subset
p1 <- DimPlot(ifnb, label = T)
p2 <- DimPlot(ifnb_sub, label = T)
p1 + p2

# Subset cells that express Versican (VCAN), a hyaluronan-binding proteoglycan
ifnb_vcan <- subset(ifnb, subset = VCAN > 1)
p1 <- FeaturePlot(ifnb, features = "VCAN")
p2 <- FeaturePlot(ifnb_vcan, features = "VCAN")
p1 + p2

Customizing plots

LabelClusters() is a useful function to customize DimPlot labeling, this makes it easy to design graphs that have better interpretability. I find it nice that the developers included several plotting functions like this, because it can be hard to find the important data inside the Seurat object for plotting using ggplot2 alone.

# Create a basic UMAP plot, of the pre-annotated clusters
p <- DimPlot(ifnb,
             reduction = "umap",
             pt.size = .3,
             label = F,
             group.by = "seurat_annotations") +
  xlab("UMAP-1") +
  ylab("UMAP-2") +
  labs(color = "Identity") +
  theme(legend.text = element_text(size = 11),
        legend.title = element_text(size = 15),
        legend.justification = "top",
        legend.key.size = unit(3, "point"),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.title = element_text(size = 16),
        plot.title = element_blank()) +
  guides(color = guide_legend(override.aes = list(size = 4.25),
                              nrow = 20))

# Use LabelClusters to customize the appearance of the annotation labels
LabelClusters(plot = p,
              id = "seurat_annotations",
              repel = T,
              force = 0.25,
              box = T,
              fill = alpha("white", 0.45),
              size = 4,
              label.r = unit(0.25, "lines"),
              label.size = NA)

Conserved markers

FindConservedMarkers() is also a very useful function when you have multiple conditions and you would like to find the conserved marker genes in each cluster. For our purposes, this is a useful sanity check that the cluster identities are in fact shared biological states between groups.

Below I find conserved markers in the ifnb dataset, between the control and stimulated PBMC samples, within in the pre-annotated natural killer cell cluster (NK).

# Set Idents to the annotation that comes with the ifnb data set "seurat_annotations"
Idents(ifnb) <- "seurat_annotations"

# Find conserved markers of NK identity versus all other cells in both conditions
NK_markers <- FindConservedMarkers(object = ifnb,
                                   ident.1 = "NK",
                                   grouping.var = "stim")
head(NK_markers)
##        CTRL_p_val CTRL_avg_log2FC CTRL_pct.1 CTRL_pct.2 CTRL_p_val_adj
## GNLY            0        6.039290      0.943      0.046              0
## FGFBP2          0        3.234476      0.500      0.021              0
## CLIC3           0        3.473570      0.601      0.024              0
## CTSW            0        3.012749      0.537      0.030              0
## KLRD1           0        2.801028      0.507      0.019              0
## KLRC1           0        2.578886      0.379      0.004              0
##           STIM_p_val STIM_avg_log2FC STIM_pct.1 STIM_pct.2 STIM_p_val_adj
## GNLY    0.000000e+00        5.866617      0.956      0.059   0.000000e+00
## FGFBP2 1.674159e-159        2.142570      0.259      0.016  2.352696e-155
## CLIC3   0.000000e+00        3.549589      0.623      0.031   0.000000e+00
## CTSW    0.000000e+00        3.138130      0.592      0.035   0.000000e+00
## KLRD1   0.000000e+00        2.860385      0.555      0.027   0.000000e+00
## KLRC1   0.000000e+00        2.521207      0.374      0.006   0.000000e+00
##             max_pval minimump_p_val
## GNLY    0.000000e+00              0
## FGFBP2 1.674159e-159              0
## CLIC3   0.000000e+00              0
## CTSW    0.000000e+00              0
## KLRD1   0.000000e+00              0
## KLRC1   0.000000e+00              0
# Plot
FeaturePlot(ifnb, features = "GNLY", split.by = "stim")

Object conversion

Seurat also has useful functions to convert back and forth between popular single-cell object classes. as.SingleCellExperiment() and as.Seurat() allow you to convert back and forth between Seurat objects and SingleCellExperiment objects which are used in many Bioconductor packages like scater or miloR.

# Convert the ifnb Seurat object to SingleCellExperiment object
ifnb_sce <- as.SingleCellExperiment(ifnb, assay = "RNA")
ifnb_sce
## class: SingleCellExperiment 
## dim: 14053 13999 
## metadata(0):
## assays(3): counts logcounts scaledata
## rownames(14053): AL627309.1 RP11-206L10.2 ... AP001062.7 LRRC3DN
## rowData names(0):
## colnames(13999): AAACATACATTTCC.1 AAACATACCAGAAA.1 ... TTTGCATGCTAAGC.1
##   TTTGCATGGGACGA.1
## colData names(11): orig.ident nCount_RNA ... interferon1 ident
## reducedDimNames(2): PCA UMAP
## mainExpName: RNA
## altExpNames(0):
# Convert back to Seurat object
ifnb_seurat <- as.Seurat(ifnb_sce)
ifnb_seurat
## An object of class Seurat 
## 14053 features across 13999 samples within 1 assay 
## Active assay: RNA (14053 features, 0 variable features)
##  2 dimensional reductions calculated: PCA, UMAP

Using the SeuratDisk library, you can also convert to the AnnData format to use in scanpy. This is also a bit of a “gotcha” because you have to load an associated library to make this happen. It would be nicer to have this as a built-in feature of Seurat itself.

# Save ifnb Seurat object as h5 and covert to AnnData
SaveH5Seurat(ifnb, filename = "results/objects/ifnb.h5Seurat", overwrite = T)
Convert("results/objects/ifnb.h5Seurat", dest = "h5ad", overwrite = T)
# import scanpy library
import scanpy

# Read in the converted Seurat object as AnnData
adata = scanpy.read_h5ad("results/objects/ifnb.h5ad")
adata
## AnnData object with n_obs × n_vars = 13999 × 14053
##     obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'stim', 'seurat_annotations', 'nCount_SCT', 'nFeature_SCT', 'integrated_snn_res.0.4', 'seurat_clusters', 'interferon1'
##     var: 'features'
##     obsm: 'X_umap'

Limitations

Scale and ML integration

The major criticism of Seurat is that it will not be able to scale effectively to large data sets. This is a legitimate problem, as throughput increases and cost decreases for single-cell experiments. Also, when considering the utility of applying single-cell to large-scale drug discovery efforts or the human cell atlas, speed and scale will be essential to processing large datasets. So at a production level, Seurat may have serious limitations.

As the developers of scanpy pointed out, Seurat is somewhere between 5-90x slower than their python package Wolf et al. 2018. Moreover, analysis of single-cell data in python has the added benefit of being able to directly interface with popular and well developed machine learning toolsets such as TensorFlow and Scikit-Learn.

Gotchas & Bugs

Transposed expression matrix

The expression matrices are transposed compared to AnnData. Seurat uses [genes x cells], AnnData use [cells x genes]. This is taken care of by the conversion functions, but annoying nonetheless.

dim(ifnb_seurat)
## [1] 14053 13999
adata.shape
## (13999, 14053)

Module score naming

Addmodulescore() adds a “1” to the input name argument when it’s saved in the metadata. This is annoying because it prevents you from easily looping through a set of genes and simultaneously plotting them using the same input name. This is likely a safety measure that prevents metadata from being overwritten, but is unnecessary, and should be more straight forward to allow easier plotting.

# Read in GO term gene lists
gene_list <- read_csv(file = "data/gene_signatures.csv")

# Loop through each column, adding a module score for each GO term
for (i in colnames(gene_list)) {
  print(paste0("Scoring: ", colnames(gene_list[i])))
  genes <- gene_list %>% pull(i) %>% na.omit() %>% unique() %>% list()
  ifnb <- AddModuleScore(ifnb,
                         features = genes,
                         name = i,
                         assay = "RNA",
                         crtl = 20)
  }
## [1] "Scoring: GO:0007049_cell_cycle"
## [1] "Scoring: GO:0030198_extracellular_matrix_organization"
## [1] "Scoring: GO.0034340_response_to_type_I_interferon"
## [1] "Scoring: GO:0045655_regulation_of_monocyte_differentiation"
head(ifnb@meta.data)
##                   orig.ident nCount_RNA nFeature_RNA stim seurat_annotations
## AAACATACATTTCC.1 IMMUNE_CTRL       3017          877 CTRL          CD14 Mono
## AAACATACCAGAAA.1 IMMUNE_CTRL       2481          713 CTRL          CD14 Mono
## AAACATACCTCGCT.1 IMMUNE_CTRL       3420          850 CTRL          CD14 Mono
## AAACATACCTGGTA.1 IMMUNE_CTRL       3156         1109 CTRL                pDC
## AAACATACGATGAA.1 IMMUNE_CTRL       1868          634 CTRL       CD4 Memory T
## AAACATACGGCATT.1 IMMUNE_CTRL       1581          557 CTRL          CD14 Mono
##                  nCount_SCT nFeature_SCT integrated_snn_res.0.4 seurat_clusters
## AAACATACATTTCC.1       2066          847                      3               3
## AAACATACCAGAAA.1       1923          713                      0               0
## AAACATACCTCGCT.1       2074          817                      3               3
## AAACATACCTGGTA.1       2230         1076                     11              11
## AAACATACGATGAA.1       1778          634                      1               1
## AAACATACGGCATT.1       1624          557                      3               3
##                   interferon1 GO.0007049_cell_cycle1
## AAACATACATTTCC.1 -0.330894340           0.0002443749
## AAACATACCAGAAA.1 -0.091329258          -0.0133844397
## AAACATACCTCGCT.1 -0.091645183          -0.0019225084
## AAACATACCTGGTA.1 -0.246427288           0.0169803778
## AAACATACGATGAA.1 -0.188778103           0.0167430460
## AAACATACGGCATT.1 -0.003577001          -0.0102648580
##                  GO.0030198_extracellular_matrix_organization1
## AAACATACATTTCC.1                                    0.02668278
## AAACATACCAGAAA.1                                    0.11640372
## AAACATACCTCGCT.1                                    0.03652001
## AAACATACCTGGTA.1                                   -0.04620496
## AAACATACGATGAA.1                                   -0.03353290
## AAACATACGGCATT.1                                    0.04570393
##                  GO.0034340_response_to_type_I_interferon1
## AAACATACATTTCC.1                               -0.27089826
## AAACATACCAGAAA.1                               -0.05460896
## AAACATACCTCGCT.1                               -0.05478965
## AAACATACCTGGTA.1                               -0.18274229
## AAACATACGATGAA.1                               -0.12928279
## AAACATACGGCATT.1                                0.04089427
##                  GO.0045655_regulation_of_monocyte_differentiation1
## AAACATACATTTCC.1                                          0.1880956
## AAACATACCAGAAA.1                                          0.1126325
## AAACATACCTCGCT.1                                          0.1026128
## AAACATACCTGGTA.1                                          0.2282217
## AAACATACGATGAA.1                                         -0.2649363
## AAACATACGGCATT.1                                          0.2631560
FeaturePlot(ifnb, features="GO.0045655_regulation_of_monocyte_differentiation1")

SpatialFeaturePlot() scaling

Another annoying bug I have come across is trying to plot features on multiple slices of spatial resolved data. Using SpatialFeaturePlot() on a spatial object with more than one slice does not display the feature expression on the same scale for each slice. This is makes comparisons between the slices impossible.

Below is a demonstration of this with a 10x Visium data set from our lab. It consists of cardiac tissue 24 h after sham surgery or ischemia reperfusion (IR). It is available for you to download here if you want to fully reproduce this notebook. It has been processed with an integrated workflow using sctransform normalization, which is documented here.

# Load data
load("data/hearts.Rdata")
hearts
## An object of class Seurat 
## 51464 features across 4565 samples within 3 assays 
## Active assay: SCT (16179 features, 6000 variable features)
##  2 other assays present: Spatial, integrated
##  2 dimensional reductions calculated: pca, umap
##  2 images present: Sham, IR
# Plot feature expression using the "Spatial" assay
DefaultAssay(hearts) <- "Spatial"
SpatialFeaturePlot(hearts, features = "S100a8")

The scales are different between slices! It doesn’t matter which assay or data slot you use, it is always different. So you need to use a custom function to calculate a common scale of expression if you want to compare slices.

# Load in my custom function that equalizes scales
source("functions/SpatialFeaturePlotScaled.R")

# Plot expression using equal scales
SpatialFeaturePlotScaled(object = hearts,
                         group = "surgery",
                         group.1 = "Sham",
                         group.2 = "IR",
                         feature_of_interest = "S100a8",
                         from.meta.data = F,
                         group.1.title = "Sham",
                         group.2.title = "IR")

Improvements

Best practice warnings

Seurat workflows have been consistently expanded to include new analysis techniques such as sctransform normalization and integration. However, when using these techniques it’s not always clear to new users that they create corrected expression data, and the results are not suited for some downstream analyses. Specifically differential gene expression analysis and visualization.

It’s an accepted best practice that differential gene expression, including cluster marker identification, should be performed on the normalized, un-corrected data layer. When running sctransform or integration workflows, the default Seurat object assay is set to “SCT” or “integrated” respectively. The user must then reset the default assay to “RNA”, and perform normalization and scaling before performing marker detection, differential gene expression and feature plotting. This step is easily omitted as it’s not part of the integration and sctransform vignettes, which may lead to spurious results.

This has been discussed many times on their github e.g. #2906, and the developers have recently released an updated SCTransform v2 that can be used for differential gene expression.

But I believe it would be a worth-while improvement to include warnings within the FeaturePlot(), VlnPlot(), FindAllMarkers() and FindMarkers() functions when using the wrong assay/slot.

Below is a demonstration of how easy it is to plot using corrected data, since there is no feedback on which assay you are using, and no warning for using an inappropriate assay.

# Set default assay to integrated, this happens by default after integration
DefaultAssay(ifnb) <- "integrated"
p1 <- FeaturePlot(ifnb, "GNLY") +  labs(subtitle = "integrated assay") +
  theme(plot.title = element_text(hjust = 0))

# Set default assay to RNA (we normalized and scaled this data earlier)
DefaultAssay(ifnb) <- "RNA"
p2 <- FeaturePlot(ifnb, "GNLY") + labs(subtitle = "RNA assay") +
  theme(plot.title = element_text(hjust = 0))

# Compare
p1 + p2

This also extends to differential expression analysis. Here we find differentially expressed genes in the dendritic cell cluster (DC) between stimulated and control groups. There is no warning when we are performing this on the integrated assay and the genes returned are quite different than those calculated with the RNA assay.

# Set default assay to integrated, this happens by default after integration
DefaultAssay(ifnb) <- "integrated"
DC_dge_int <- FindMarkers(ifnb, 
                          subset.ident = "DC",
                          group.by = "stim",
                          ident.1 = "STIM",
                          ident.2 = "CTRL")

# Show top 5 differentially expressed genes calculated with the integrated assay
head(DC_dge_int, 5)
##                     p_val avg_log2FC pct.1 pct.2    p_val_adj
## SMAD7        1.035857e-72   2.124868 0.350 0.016 2.071715e-69
## RP11-390P2.4 1.630071e-72 -13.717893 0.977 0.008 3.260143e-69
## DOLK         2.649048e-70  -3.767908 0.804 0.019 5.298096e-67
## NDUFAF5      1.790260e-69  -6.891424 0.696 0.023 3.580521e-66
## UBE2D4       1.297769e-67  -3.461881 0.921 0.027 2.595538e-64
# Set default assay to RNA (we normalized and scaled this data earlier)
DefaultAssay(ifnb) <- "RNA"
DC_dge_rna <- FindMarkers(ifnb, 
                          subset.ident = "DC",
                          group.by = "stim",
                          ident.1 = "STIM",
                          ident.2 = "CTRL")

# Show top 5 differentially expressed genes calculated with the RNA assay
head(DC_dge_rna, 5)
##                p_val avg_log2FC pct.1 pct.2    p_val_adj
## IFIT3   2.087353e-90   5.172775 0.991 0.023 2.933358e-86
## IFIT1   1.053375e-89   5.070689 0.981 0.012 1.480308e-85
## IFIT2   1.409326e-89   4.501680 0.981 0.012 1.980525e-85
## TNFSF10 3.122225e-89   5.156429 0.981 0.019 4.387663e-85
## CXCL10  9.134153e-89   7.348824 0.977 0.012 1.283622e-84

Renaming idents

When you have finally annotated your dataset, and go to rename the identities from the numbered clusters to your annotations, you can use the RenameIdents() function. However, this renaming is only temporary and is not recorded in the object unless you manually stash them yourself.

I think an improvement to this function would be to include the new idents directly into the metadata and add an argument for what their name should be. Instead you have to do something like below:

# Set the idents back to cluster numbers for this example
Idents(ifnb) <- "seurat_clusters"
levels(ifnb)
##  [1] "0"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11"
# Rename them with annotations
ifnb <- RenameIdents(ifnb,
                    "0" = "Dantes",
                    "1" = "Pequods",
                    "2" = "Peice",
                    "3" = "Burts",
                    "4" = "Coalfired",
                    "5" = "Pats",
                    "6" = "BoilerRoom",
                    "7" = "SpaccaNapoli",
                    "8" = "VitoandNicks",
                    "9" = "BillsPub",
                    "10" = "GinosEast",
                    "11" = "Aurelios")
levels(ifnb)
##  [1] "Dantes"       "Pequods"      "Peice"        "Burts"        "Coalfired"   
##  [6] "Pats"         "BoilerRoom"   "SpaccaNapoli" "VitoandNicks" "BillsPub"    
## [11] "GinosEast"    "Aurelios"
# Store annotations in the metadata for use later
ifnb@meta.data$pizza_annotation <- Idents(ifnb)
head(ifnb@meta.data)
##                   orig.ident nCount_RNA nFeature_RNA stim seurat_annotations
## AAACATACATTTCC.1 IMMUNE_CTRL       3017          877 CTRL          CD14 Mono
## AAACATACCAGAAA.1 IMMUNE_CTRL       2481          713 CTRL          CD14 Mono
## AAACATACCTCGCT.1 IMMUNE_CTRL       3420          850 CTRL          CD14 Mono
## AAACATACCTGGTA.1 IMMUNE_CTRL       3156         1109 CTRL                pDC
## AAACATACGATGAA.1 IMMUNE_CTRL       1868          634 CTRL       CD4 Memory T
## AAACATACGGCATT.1 IMMUNE_CTRL       1581          557 CTRL          CD14 Mono
##                  nCount_SCT nFeature_SCT integrated_snn_res.0.4 seurat_clusters
## AAACATACATTTCC.1       2066          847                      3               3
## AAACATACCAGAAA.1       1923          713                      0               0
## AAACATACCTCGCT.1       2074          817                      3               3
## AAACATACCTGGTA.1       2230         1076                     11              11
## AAACATACGATGAA.1       1778          634                      1               1
## AAACATACGGCATT.1       1624          557                      3               3
##                   interferon1 GO.0007049_cell_cycle1
## AAACATACATTTCC.1 -0.330894340           0.0002443749
## AAACATACCAGAAA.1 -0.091329258          -0.0133844397
## AAACATACCTCGCT.1 -0.091645183          -0.0019225084
## AAACATACCTGGTA.1 -0.246427288           0.0169803778
## AAACATACGATGAA.1 -0.188778103           0.0167430460
## AAACATACGGCATT.1 -0.003577001          -0.0102648580
##                  GO.0030198_extracellular_matrix_organization1
## AAACATACATTTCC.1                                    0.02668278
## AAACATACCAGAAA.1                                    0.11640372
## AAACATACCTCGCT.1                                    0.03652001
## AAACATACCTGGTA.1                                   -0.04620496
## AAACATACGATGAA.1                                   -0.03353290
## AAACATACGGCATT.1                                    0.04570393
##                  GO.0034340_response_to_type_I_interferon1
## AAACATACATTTCC.1                               -0.27089826
## AAACATACCAGAAA.1                               -0.05460896
## AAACATACCTCGCT.1                               -0.05478965
## AAACATACCTGGTA.1                               -0.18274229
## AAACATACGATGAA.1                               -0.12928279
## AAACATACGGCATT.1                                0.04089427
##                  GO.0045655_regulation_of_monocyte_differentiation1
## AAACATACATTTCC.1                                          0.1880956
## AAACATACCAGAAA.1                                          0.1126325
## AAACATACCTCGCT.1                                          0.1026128
## AAACATACCTGGTA.1                                          0.2282217
## AAACATACGATGAA.1                                         -0.2649363
## AAACATACGGCATT.1                                          0.2631560
##                  pizza_annotation
## AAACATACATTTCC.1            Burts
## AAACATACCAGAAA.1           Dantes
## AAACATACCTCGCT.1            Burts
## AAACATACCTGGTA.1         Aurelios
## AAACATACGATGAA.1          Pequods
## AAACATACGGCATT.1            Burts

Spatial data interaction

While Seurat does allow users to subset out anatomical regions, currently this is based only on the xy coordinates of the spots, which makes this extremely tedious to section out biologically relevant areas. If users could open a shiny application and draw their regions of interest for sub-setting, this would allow a must easier and faster way to make analyses on regions of interest.

# Entire IR heart
p1 <- SpatialDimPlot(hearts, images = "IR", crop = F) + labs(title = "Full")

# Subset IR heart based on coordinates
hearts_sub <- subset(hearts, ir_imagerow > 800 | ir_imagecol < 380, invert = TRUE)
p2 <- SpatialDimPlot(hearts_sub, images = "IR", crop = F)+ labs(title = "Subset")

# Compare
p1 + p2

Similar to what is done in Napari would be nice, below is an example taken from their tutorial.

DGE expansion

While there are many methods available to employ for differential gene expression (DGE) analysis within the Seruat function FindMarkers(), a recent publication evaluated the performance of single-cell based differential gene expression versus pseudo-bulk approaches that employ the DEseq2/edgeR frameworks for experiments with biological replicates. The authors found that single-cell methods (like the wilcoxon rank sum default in FindMarkers()) do not accurately account for the variability of biological replicates and therefore produce hundreds of false positives.

One major improvement to the DGE analysis would be to include a pseudo-bulk method as an option.

Another quality of life addition to their DGE tools would be a function for plotting of differentially expressed genes, like a volcano plot. Below is an example.

# Initialize a list to store results
genes <- list()

# Switch back to biologically meaningful annotations
Idents(ifnb) <- "seurat_annotations"

# Loop through each ident, calculate DGE betwen control and stimulated samples
for (i in levels(Idents(ifnb))) {
  results <- FindMarkers(ifnb,
                         subset.ident = i,
                         group.by = "stim",
                         ident.1 = "STIM",
                         base = 2,
                         logfc.threshold = 0,
                         densify = T)
  results$cluster <- i
  results$gene <- row.names(results)
  row.names(results) <- NULL
  results$regulation <- ifelse(results$avg_log2FC > 0, "Up", "Down")
  genes[[i]] <- results
}

# Bind the list of results together
dge_no_threshold <- do.call(rbind, genes)

# Remove rownames, genes have been stored as a column
rownames(dge_no_threshold) <- NULL

# Load in my custom VolcanoPlot function
source("functions/VolcanoPlot.R")

# Plot the differentially expressed genes in the DC cluster
# Grey points are genes that are not significantly regulated
# Blue points are significantly downregulated genes
# Red Points are significantly upregualted genes
# Top 20 genes based on fold change in each direction are labeled with their symbol
VolcanoPlot(df = dge_no_threshold, identity = "DC",
            title = "Differentially expressed genes in stimulated DCs")

sessionInfo()
## R version 4.2.2 (2022-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.0.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] ggrepel_0.9.2               reticulate_1.26            
##  [3] scater_1.26.1               scuttle_1.8.1              
##  [5] SingleCellExperiment_1.20.0 SummarizedExperiment_1.28.0
##  [7] Biobase_2.58.0              GenomicRanges_1.50.1       
##  [9] GenomeInfoDb_1.34.3         IRanges_2.32.0             
## [11] S4Vectors_0.36.0            BiocGenerics_0.44.0        
## [13] MatrixGenerics_1.10.0       matrixStats_0.63.0         
## [15] knitr_1.41                  forcats_0.5.2              
## [17] stringr_1.5.0               dplyr_1.0.10               
## [19] purrr_0.3.5                 readr_2.1.3                
## [21] tidyr_1.2.1                 tibble_3.1.8               
## [23] ggplot2_3.4.0               tidyverse_1.3.2            
## [25] SeuratDisk_0.0.0.9020       stxKidney.SeuratData_0.1.0 
## [27] stxBrain.SeuratData_0.1.1   pbmc3k.SeuratData_3.1.4    
## [29] ifnb.SeuratData_3.1.0       SeuratData_0.2.2           
## [31] SeuratObject_4.1.3          Seurat_4.3.0               
## 
## loaded via a namespace (and not attached):
##   [1] rappdirs_0.3.3            scattermore_0.8          
##   [3] bit64_4.0.5               irlba_2.3.5.1            
##   [5] multcomp_1.4-20           DelayedArray_0.24.0      
##   [7] data.table_1.14.6         RCurl_1.98-1.9           
##   [9] generics_0.1.3            metap_1.8                
##  [11] ScaledMatrix_1.6.0        cowplot_1.1.1            
##  [13] TH.data_1.1-1             RANN_2.6.1               
##  [15] future_1.29.0             bit_4.0.5                
##  [17] tzdb_0.3.0                mutoss_0.1-12            
##  [19] spatstat.data_3.0-0       xml2_1.3.3               
##  [21] lubridate_1.9.0           httpuv_1.6.6             
##  [23] assertthat_0.2.1          viridis_0.6.2            
##  [25] gargle_1.2.1              xfun_0.35                
##  [27] hms_1.1.2                 jquerylib_0.1.4          
##  [29] evaluate_0.18             promises_1.2.0.1         
##  [31] fansi_1.0.3               dbplyr_2.2.1             
##  [33] readxl_1.4.1              igraph_1.3.5             
##  [35] DBI_1.1.3                 htmlwidgets_1.5.4        
##  [37] spatstat.geom_3.0-3       googledrive_2.0.0        
##  [39] ellipsis_0.3.2            backports_1.4.1          
##  [41] deldir_1.0-6              sparseMatrixStats_1.10.0 
##  [43] vctrs_0.5.1               here_1.0.1               
##  [45] ROCR_1.0-11               abind_1.4-5              
##  [47] cachem_1.0.6              withr_2.5.0              
##  [49] progressr_0.11.0          vroom_1.6.0              
##  [51] sctransform_0.3.5         goftest_1.2-3            
##  [53] mnormt_2.1.1              cluster_2.1.4            
##  [55] lazyeval_0.2.2            crayon_1.5.2             
##  [57] hdf5r_1.3.7               spatstat.explore_3.0-5   
##  [59] pkgconfig_2.0.3           labeling_0.4.2           
##  [61] qqconf_1.3.0              nlme_3.1-160             
##  [63] vipor_0.4.5               rlang_1.0.6              
##  [65] globals_0.16.2            lifecycle_1.0.3          
##  [67] miniUI_0.1.1.1            sandwich_3.0-2           
##  [69] mathjaxr_1.6-0            modelr_0.1.10            
##  [71] rsvd_1.0.5                rprojroot_2.0.3          
##  [73] cellranger_1.1.0          polyclip_1.10-4          
##  [75] lmtest_0.9-40             Matrix_1.5-3             
##  [77] zoo_1.8-11                reprex_2.0.2             
##  [79] beeswarm_0.4.0            ggridges_0.5.4           
##  [81] googlesheets4_1.0.1       png_0.1-8                
##  [83] viridisLite_0.4.1         bitops_1.0-7             
##  [85] KernSmooth_2.23-20        DelayedMatrixStats_1.20.0
##  [87] parallelly_1.32.1         spatstat.random_3.0-1    
##  [89] beachmat_2.14.0           scales_1.2.1             
##  [91] magrittr_2.0.3            plyr_1.8.8               
##  [93] ica_1.0-3                 zlibbioc_1.44.0          
##  [95] compiler_4.2.2            RColorBrewer_1.1-3       
##  [97] plotrix_3.8-2             fitdistrplus_1.1-8       
##  [99] cli_3.4.1                 XVector_0.38.0           
## [101] listenv_0.8.0             patchwork_1.1.2          
## [103] pbapply_1.6-0             MASS_7.3-58.1            
## [105] tidyselect_1.2.0          stringi_1.7.8            
## [107] highr_0.9                 yaml_2.3.6               
## [109] BiocSingular_1.14.0       grid_4.2.2               
## [111] sass_0.4.4                tools_4.2.2              
## [113] timechange_0.1.1          future.apply_1.10.0      
## [115] parallel_4.2.2            rstudioapi_0.14          
## [117] gridExtra_2.3             farver_2.1.1             
## [119] Rtsne_0.16                digest_0.6.30            
## [121] shiny_1.7.3               Rcpp_1.0.9               
## [123] broom_1.0.1               later_1.3.0              
## [125] RcppAnnoy_0.0.20          httr_1.4.4               
## [127] Rdpack_2.4                colorspace_2.0-3         
## [129] rvest_1.0.3               fs_1.5.2                 
## [131] tensor_1.5                splines_4.2.2            
## [133] uwot_0.1.14               sn_2.1.0                 
## [135] spatstat.utils_3.0-1      sp_1.5-1                 
## [137] multtest_2.54.0           plotly_4.10.1            
## [139] xtable_1.8-4              jsonlite_1.8.3           
## [141] R6_2.5.1                  TFisher_0.2.0            
## [143] pillar_1.8.1              htmltools_0.5.3          
## [145] mime_0.12                 glue_1.6.2               
## [147] fastmap_1.1.0             BiocParallel_1.32.4      
## [149] BiocNeighbors_1.16.0      codetools_0.2-18         
## [151] mvtnorm_1.1-3             utf8_1.2.2               
## [153] lattice_0.20-45           bslib_0.4.1              
## [155] spatstat.sparse_3.0-0     numDeriv_2016.8-1.1      
## [157] ggbeeswarm_0.6.0          leiden_0.4.3             
## [159] survival_3.4-0            limma_3.54.0             
## [161] rmarkdown_2.18            munsell_0.5.0            
## [163] GenomeInfoDbData_1.2.9    haven_2.5.1              
## [165] reshape2_1.4.4            gtable_0.3.1             
## [167] rbibutils_2.2.10